منابع مشابه
Intelligent Wrapping from PDF Documents
Wrapping is the process of navigating a data source, semiautomatically extracting data and transforming it into a form suitable for data processing applications. The semi-structured form of web pages, coupled with the availability of business-relevant data, has led to the availability of several established products on the market for wrapping data from the Web. One such approach is the Lixto me...
متن کاملLayout and Content Extraction for PDF Documents
Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logi...
متن کاملAutomatic indexing of PDF documents with ontologies
Indexing large bodies of data is necessary to enable satisfactory search results. Ontologies serve as fixed vocabularies to index data from different viewpoints. We describe how AIDAS, a software tool, automatically divides the source data (PDF documents) into reusable chunks, how it automatically indexes these chunks and stores them in a database to enable reuse.
متن کاملOptimizing PDF output size of TEX documents
There are several tools for generating PDF output from a TEX document. By choosing the appropriate tools and configuring them properly, it is possible to reduce the PDF output size by a factor of 3 or even more, thus reducing document download times, hosting and archiving costs. We enumerate the most common tools, and show how to configure them to reduce the size of text, fonts, images and cros...
متن کاملHiding Malicious Content in PDF Documents
This paper is a proof-of-concept demonstration for a specific digital signatures vulnerability that shows the ineffectiveness of the WYSIWYS (What You See Is What You Sign) concept. The algorithm is fairly simple: the attacker generates a polymorphic file that has two different types of content (text, as a PDF document for example, and image: TIFF – two of the most widely used file formats). Wh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Information Technology and Libraries
سال: 2008
ISSN: 2163-5226,0730-9295
DOI: 10.6017/ital.v27i3.3246